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f^ l' Consistency is a key property of all statistical procedures ana- 

•^r . lyzing randomly sampled data. Surprisingly, despite decades of work, 

little is known about consistency of most clustering algorithms. In 
this paper we investigate consistency of the popular family of spec- 
tral clustering algorithms, which clusters the data with the help of 
Li^ ' eigenvectors of graph Laplacian matrices. We develop new methods to 

r^ , establish that, for increasing sample size, those eigenvectors converge 

to the eigenvectors of certain limit operators. As a result, we can 
prove that one of the two major classes of spectral clustering (nor- 
C^ ■ malized clustering) converges under very general conditions, while 

the other (unnormalized clustering) is only consistent under strong 

additional assumptions, which are not always satisfied in real data. 

We conclude that our analysis provides strong evidence for the supe- 

,, ^ riority of normalized spectral clustering. 

oo 

^ir ' 1. Introduction. Clustering is a popular technique which is widely used 

f"^ . in statistics, computer science and various data analysis applications. Given 

_J I a set of data points, the goal is to separate the points in several groups based 

^^ ' on some notion of similarity. Very often it is a natural mathematical model to 

00 . assume that the data points have been drawn from an underlying probability 

distribution. In this setting it is desirable that clustering algorithms should 

satisfy certain basic consistency requirements: 
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• In the large sample limit, do the clusterings constructed by the given 
C^ ' algorithm "converge" to a clustering of the whole underlying space? 

• If the clusterings do converge, is the limit clustering a reasonable partition 
of the whole underlying space, and what are the properties of this limit 
clustering? 
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Interestingly, while extensive literature exists on clustering and partitioning 
(e.g., see Jain, Murty and Flynn [27] for a review), very few clustering al- 
gorithms have been analyzed or shown to converge in the setting where the 
data is sampled from an arbitrary probability distribution. In a parametric 
setting, clusters are often identified with the individual components of a mix- 
ture distribution. Then clustering reduces to standard parameter estimation, 
and of course there exist many results on the consistency of such estima- 
tors. However, in a nonparametric setting there are only two major classes 
of clustering algorithms where convergence questions have been studied at 
all: single linkage and /c-means. 

The fc-means algorithm clusters a given set of points in M by constructing 
k cluster centers such that the sum of squared distances of all data points 
to their closest cluster centers is minimized (e.g.. Section 14.3 of Hastie, 
Tibshirani and Friedman [23]). Pollard [38] shows consistency of the global 
minimizer of the objective function for A;-means clustering. However, as the 
/c-means objective function is highly nonconvex, the problem of finding its 
global minimum is often infeasible. As a consequence, the guarantees on 
the consistency of the minimizer are purely theoretical and do not apply 
to existing algorithms, which use local optimization techniques. The same 
problem also concerns all the follow-up articles on Pollard [38] by many 
different authors. 

Linkage algorithms construct a hierarchical clustering of a set of data 
points by starting with each point being a cluster, and then successively 
merging the two closest clusters (e.g., Section 14.3 of Hastie, Tibshirani 
and Friedman [23]). For this class of algorithms, Hartigan [22] demonstrates 
a weaker notion of consistency. He proves that the algorithm will identify 
certain high-density regions, but he does not obtain a general consistency 
result. 

In our opinion, the results about the consistency of clustering algorithms 
which can be found in the literature are far from satisfactory. This lack 
of consistency guarantees is especially striking as clustering algorithms are 
widely used in most scientific disciplines which deal with data in any form. 

In this paper we investigate the limit behavior of the class of spectral 
clustering algorithms. Spectral clustering is a popular technique going back 
to Donath and Hoffman [17] and Fiedler [19]. In its simplest form, it uses 
the second eigenvector of the graph Laplacian matrix constructed from the 
affinity graph between the sample points to obtain a partition of the samples 
into two groups. Different versions of spectral clustering have been used for 
many different problems such as load balancing (Van Driessche and Roose 
[46]), parallel computations (Hendrickson and Leland [24]), VLSI design 
(Hagen and Kahng [21]) and sparse matrix partitioning (Pothen, Simon and 
Liou [40]). Laplacian-based clustering algorithms also have found success 
in applications to image segmentation (Shi and Malik [43]), text mining 
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(Dhillon [15]) and as general purpose methods for data analysis and clus- 
tering (Alpert [2], Kannan, Vempala and Vetta [28], Ding et al. [16], Ng, 
Jordan and Weiss [36] and Belkin and Niyogi [10]). A nice survey on the 
history of spectral clustering can be found in Spielnian and Teng [44]; for a 
tutorial introduction to spectral clustering, see von Luxburg [48]. 

While there has been some theoretical work on properties of spectral 
clustering on finite point sets (e.g., Spielman and Teng [44], Gauttery and 
Miller [20], Kannan, Vempala and Vetta [28]), so far there have not been 
any results discussing the limit behavior of spectral clustering for samples 
drawn from some underlying probability distribution. In the current article, 
we establish consistency results and convergence rates for several versions of 
spectral clustering. To prove those results, the main step is to establish the 
convergence of eigenvalues and eigenvectors of random graph Laplace matri- 
ces for growing sample size. Interestingly, our analysis shows that while one 
type of spectral clustering ("normalized") is consistent under very general 
conditions, another popular version of spectral clustering ( "unnormalized" ) 
is only consistent under some very specific conditions which do not have to 
be satisfied in practice. We therefore conclude that the normalized clustering 
algorithm should be the preferred method in practical applications. 

From a mathematical point of view, the question of convergence of spec- 
tral clustering boils down to the question of convergence of spectral prop- 
erties of random graph Laplacian matrices constructed from sample points. 
The convergence of eigenvalues and eigenvectors of certain random matrices 
has already been studied extensively in the statistics community, especially 
for random matrices of fixed size such as sample covariance matrices, or for 
random matrices with i.i.d. entries (see Bai [6] for a review). However, those 
results cannot be applied in our setting, as the size of the graph Laplacian 
matrices (n x n) increases with the sample size n, and the entries of the ran- 
dom graph Laplacians are not independent from each other. In the machine 
learning community, the spectral convergence of positive definite "kernel 
matrices" has attracted some attention, as can be seen in Shawe- Taylor et 
al. [42], Bengio et al. [12] and Williams and Seeger [50]. Here, several au- 
thors build on the work of Baker [7], who studies numerical solutions of 
integral equations by deterministic discretizations of integral operators on 
a grid. However, his methods cannot be carried over to our case, where in- 
tegral operators are discretized by a random selection of sample points (see 
Section 11.10 of von Luxburg [47] for details). Finally, Koltchinskii [30] and 
Koltchinskii and Gine [31] have obtained convergence results for random 
discretizations of integral operators which are close to what we would need 
for spectral clustering. However, to apply their techniques and results, it 
is necessary that the operators under consideration are Hilbert-Schmidt, 
which turns out not to be the case for the unnormalized Laplacian. Con- 
sequently, to prove consistency results for spectral clustering, we have to 
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derive new methods which hold under more general conditions than all the 
methods mentioned above. As a by-product we recover certain results from 
Koltchinskii [30] and Koltchinskii and Gine [31] by using considerably sim- 
pler techniques. 

There has been some debate on the question whether normalized or un- 
normalized spectral clustering should be used. Recent papers using the nor- 
malized version include Van Driessche and Roose [46], Shi and Malik [43], 
Kannan, Vempala and Vetta [28], Ng, Jordan and Weiss [36] and Meila and 
Shi [33], while Barnard, Pothen and Simon [8] and Gauttery and Miller [20] 
use unnormalized clustering. Comparing the empirical behavior of both ap- 
proaches. Van Driessche and Roose [46] and Weiss [49] find some evidence 
that the normalized version should be preferred. On the other hand, under 
certain conditions, Higham and Kibble [25] advocate for the unnormalized 
version. It seems difficult to resolve this question from purely graph-theoretic 
considerations, as both normalized and unnormalized spectral clustering can 
be justified by similar graph theoretic principles (see next section). In our 
work we now obtain the first theoretical results on this question. They show 
the superiority of normalized spectral clustering over unnormalized spectral 
clustering from a statistical point of view. 

This paper is organized as follows: In Section 2 we briefly introduce the 
family of spectral clustering algorithms, and describe what the difference be- 
tween "normalized" and "unnormalized" spectral clustering is. After giving 
an informal overview of our consistency results in Section 3, we introduce 
mathematical prerequisites and notation in Section 4. The convergence of 
normalized spectral clustering is stated and proved in Section 5, and rates 
of convergence are proved in Section 6. In Section 7 we establish conditions 
for the convergence of unnormalized spectral clustering. Those conditions 
are studied in detail in Section 8. In particular, we investigate the spectral 
properties of the limit operators corresponding to normalized and unnor- 
malized spectral clustering, point out some important differences, and show 
theoretical and practical examples where the convergence conditions in the 
unnormalized case are violated. 

2. Spectral clustering. The purpose of this section is to briefly introduce 
the class of spectral clustering algorithms. For a comprehensive introduction 
to spectral clustering and its various derivations, explanations and proper- 
ties, we refer to von Luxburg [48]. Readers who are familiar with spectral 
clustering or who first want to get an overview over our results are encour- 
aged to jump to Section 3 immediately. 

Assume we are given a set of data points Xi, . . . , Xn and pairwise similar- 
ities kij := k{Xi,Xj) which are symmetric (i.e., kij = kji) and nonnegative. 
We denote the data similarity matrix as K := (kij)ij=i^,,,^n and define the 
matrix D to be the diagonal matrix with entries di ■=J21=i^ij- Spectral 
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clustering uses matrices which have been studied extensively in spectral 
graph theory, so-called graph Laplacians. Graph Laplacians exist in three 
different flavors. The unnormalized graph Laplacian (sometimes also called 
the combinatorial Laplacian) is defined as the matrix 

L = D-K. 

The normalized graph Laplacians are defined as 

L' = D~^I'-LD~^'^ = 1- D~^I^KD~^/\ 

L" = D^^L = 1- D^^K. 

Given a vector / = (/i, . . . , /„)* G M", the following key identity can be easily 
verified: 

n 

This equation shows that L is positive semi-definite. It can easily be seen 
that the smallest eigenvalue of L is 0, and the corresponding eigenvector 
is the constant one vector 1 = (1, . . . , 1)*. Similar properties can be shown 
for L' and L" . There is a tight relationship between the spectra of the two 
normalized graph Laplacians: v is an eigenvector of L" with eigenvalue A if 
and only \i w = D^/'^v is an eigenvector of V with eigenvalue A. So from a 
spectral point of view, the two normalized graph Laplacians are equivalent. 
A discussion of various other properties of graph Laplacians can be found in 
the literature; see, for example, Chung [14] for the normalized and Mohar 
[35] for the unnormalized case. 

There exist two major versions of spectral clustering, which we call "nor- 
malized" or "unnormalized" spectral clustering, respectively. The basic ver- 
sions of those algorithms can be summarized as follows: 



Basic spectral bi-clustering algorithms 

Input: Similarity matrix K ^W^^"^ . 

Find the eigenvector v corresponding to the second 
smallest eigenvalue for one of the following 
problems : 

Lv = Xv (for unnormalized spectral clustering) , 

L'v = Xv (for normalized spectral clustering). 

Output: Clusters A = {j;Vj >0} and A = {j;vj <0} . 



It is not straight forward to see why the clusters produced by those algo- 
rithms are useful in any way. The roots of spectral clustering lie in spectral 
graph theory. Here we consider the "similarity graph" induced by the data, 
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namely, the graph with adjacency matrix K. On this graph, clustering re- 
duces to the problem of graph partitioning: we want to find a partition of the 
graph such that the edges between different groups have very low weights 
(which means that points in different clusters are dissimilar from each other) 
and the edges within a group have high weights (which means that points 
within the same cluster are similar to each other) . Different ways of formulat- 
ing and solving the objective functions of such graph partitioning problems 
lead to normalized and unnormalized spectral clustering, respectively. For 
details, we refer to von Luxburg [48]. 

Note that the spectral clustering algorithms as presented above are sim- 
plified versions of spectral clustering. The implementations used in practice 
can differ in various details. In particular, in the case when one is interested 
in obtaining more than two clusters, one typically uses not only the second 
but also the next few eigenvectors to construct a partition. Moreover, more 
complicated rules can be used to construct a partition from the coordinates 
of the eigenvectors than simply thresholding the eigenvector at 0. For details, 
see von Luxburg [48]. 

3. Informal statement of our results. In this section we want to present 
our main results in a slightly informal but intuitive manner. For the pre- 
cise mathematical details and proofs, we refer to the following sections. The 
goal of this article is to study the behavior of normalized and unnormalized 
spectral clustering on random samples when the sample size n is growing. 
In Section 2 we have seen that spectral clustering partitions a given sam- 
ple Xi , . . . , Xn according to the coordinates of the first eigenvectors of the 
(normalized or unnormalized) Laplace matrix. To stress that the Laplace 
matrices depend on the sample size n, from now on we denote the unnor- 
malized and normalized graph Laplacians by L„, L^ and L'^ (instead of 
L, L' and L" as in the last section). To investigate whether the various 
spectral clustering algorithms converge, we will have to establish conditions 
under which the eigenvectors of the Laplace matrices "converge." To see 
which kind of convergence results we aim at, consider the case of the second 
eigenvector (fi, . . . ,Wn)* of L„. It can be interpreted as a function /„ on the 
discrete space Xn '■= {Xi, . . . ,Xn} by defining fn{Xi) := Vi, and clustering 
is then performed according to whether /„ is smaller or larger than a cer- 
tain threshold. It is clear that in the limit for n — > oo, we would like this 
discrete function /„ to converge to a function / on the whole data space 
X such that we can use the values of this function to partition the data 
space. In our case it will turn out that this space can be chosen as C{X), 
the space of continuous functions on X. In particular, we will construct a 
degree function d € C{X) which will be the "limit" of the discrete degree 
vector {di,. . . ,dn)- Moreover, we will explicitly construct linear operators 
U, U' and U" on C{X) which will be the limit of the discrete operators L„, 
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L^ and L^. Certain eigenvectors of the discrete operators are then proved 
to "converge" (in a certain sense to be explained later) to eigenfunctions of 
those limit operators. Those eigenfunctions will then be used to construct a 
partition of the whole data space X. 

In the case of normalized spectral clustering it will turn out that this 
limit process behaves very nicely. We can prove that, under certain mild 
conditions, the partitions constructed on finite samples converge to a sensible 
partition of the whole data space. In meta-language, this result can be stated 
as follows: 

Result 1 (Convergence of normalized spectral clustering). Under mild 
assumptions, if the first r eigenvalues Ai,...,Ar. of the limit operator U' 
satisfy Aj 7^ 1 and have multiplicity 1, then the same holds for the first r 
eigenvalues of L'^ for sufficiently large n. In this case, the first r eigenvalues 
of L^ converge to the first r eigenvalues of U' a.s., and the corresponding 
eigenvectors converge a.s. The clusterings constructed by normalized spec- 
tral clustering from the first r eigenvectors on finite samples converge almost 
surely to a limit clustering of the whole data space. 

In the unnormalized case, the convergence theorem looks quite similar, 
but there are some subtle differences that will turn out to be important. 

Result 2 (Convergence of unnormalized spectral clustering). Under mild 
assumptions, if the first r eigenvalues of the limit operator U have multiplic- 
ity 1 and do not lie in the range of the degree function d, then the same holds 
for the first r eigenvalues of -L^ for sufficiently large n. In this case, the 
first r eigenvalues of -L„ converge to the first r eigenvalues of U a.s., and 
the corresponding eigenvectors converge a.s. The clusterings constructed 
by unnormalized spectral clustering from the first r eigenvectors on finite 
samples converge almost surely to a limit clustering of the whole data space. 

On the first glance, both results look very similar: if first eigenvalues are 
"nice," then spectral clustering converges. However, the difference between 
Results 1 and 2 is what it means for an eigenvalue to be "nice." For the con- 
vergence statements to hold, in Result 1 we only need the condition Aj 7^ 1, 
while in Result 2 the condition Aj ^ rg((i) has to be satisfied. Both condi- 
tions are needed to ensure that the eigenvalue Aj is isolated in the spectrum 
of the limit operator, which is a fundamental requirement for applying per- 
turbation theory to the convergence of eigenvectors. We will see that in the 
normalized case, the limit operator U' has the form Id — T where T is a com- 
pact linear operator. As a consequence, the spectrum of U' is very benign, 
and all eigenvalues A 7^ 1 are isolated and have finite multiplicity. In the un- 
normalized case, however, the limit operator will have the form U = M — S, 
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where M is a multiplication operator and S a compact integral operator. 
The spectrum of U is not as nice as the one of U' , and, in particular, it 
contains the continuous interval rg((i). Eigenvalues of this operator will only 
be isolated in the spectrum if they satisfy the condition A ^ rg((i). As the 
following result shows, this condition has important consequences. 

Result 3 [The condition A ^ rg((i) is necessary]. 

1. There exist examples of similarity functions such that there exists no 
nonzero eigenvalue outside of rg{d). 

2. If this is the case, the sequence of second eigenvalues of ^L„ computed 
by any numerical eigensolver converges to inmd{x). The corresponding 
eigenvectors do not yield a sensible clustering of the data space. 

3. For a large class of reasonable similarity functions, there are only finitely 
many eigenvalues (say, vq) inside the interval ]0,min(i(x)[. In this case, 
the same problems as above occur if the number r of eigenvalues used for 
clustering satisfies r > vq. 

4. The condition A ^ rg((i) refers to the limit case and, hence, cannot be 
verified on the finite sample. 

This result complements Result 2. The main message is that there are 
many examples where the conditions of Result 2 are not satisfied, that in this 
case unnormalized spectral clustering fails completely, and that we cannot 
detect on a finite sample whether the convergence conditions are satisfied or 
not. 

To further investigate the statistical properties of normalized spectral 
clustering, we compute rates of convergence. Informally, our result is the 
following: 

Result 4 (Rates of convergence). The rates of convergence of normal- 
ized spectral clustering can be expressed in terms of regularity conditions 
of the similarity function k. For example, for the case of the widely used 
Gaussian similarity function k{x,y) =exp(— ||rE — ?/|p/o"^) on M"^, we obtain 
a rate of 0{1/ ^/n). 

Finally, we show how our theoretical results influence the results of spec- 
tral clustering in practice. In particular, we demonstrate differences between 
the behavior of normalized and unnormalized spectral clustering. 

4. Prerequisites and notation. In the rest of the paper we always make 
the following general assumptions: 
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General assumptions. The data space A' is a compact metric space, 
B the Borel cr-algebra on X, and P a probabihty measure on (X,B). With- 
out loss of generahty we assume that the support of P coincides with X. 
The sample points (Xj)jgN are drawn independently according to P. The 
similarity function k : XxX ^M is supposed to be symmetric, continuous 
and bounded away from by a positive constant, that is, there exists a 
constant I > such that k{x, y) > I for all x,y £ X. 

Most of those assumptions are standard in the spectral clustering litera- 
ture. We need the symmetry of the similarity function in order to be able to 
represent our data by an undirected graph (note that spectral graph theory 
does not carry over to directed graphs as, e.g., the graph Laplacians are 
no longer symmetric). The continuity of k is needed for robustness reasons: 
small changes in the data should not change the result too much. For the 
same reason, we make the assumption that k should be bounded away from 
0. This becomes necessary when we consider normalized graph Laplacians, 
where we divide by the degree function and still want the result to be robust 
with respect to small changes in the underlying data. Only the compactness 
of X is added for mathematical convenience. Most results in this article 
are also true without compactness, but their proofs would require a serious 
technical overhead which does not add to the general understanding of the 
problem. 

For a finite sample Xi ,... , X^, which has been drawn i.i.d. according to P, 
and a given similarity function k as in the General assumptions, we denote 
the similarity matrix by Kn = {k{Xi,Xj))ij<n and the degree matrix D„ as 
the diagonal matrix with the degrees di = X]?=i ^(^ij^j) on the diagonal. 
Similarly, we will denote the unnormalized and normalized Laplace matrices 
by Ln = Dn — Kn and L'^ = Dn LnDn ■ The eigenvalues of the Laplace 
matrices = Ai < A2 < • • • < A^ will always be ordered in increasing order, 
respecting multiplicities. The term "first eigenvalue" always refers to the 
trivial eigenvalue Ai = 0. Note that throughout the whole paper, we will use 
superscript-t (such as /*) to denote the transpose of a vector or a matrix, 
while "primes" (as in L' or L") are used to distinguish different matrices 
and operators. / is used to denote the identity matrix. 

For a real- valued function /, we always denote the range of the function by 
rg(/). If X is connected and / is continuous, rg(/) = [inf^^ /(.x),sup^ /(x)]. 
The restriction operator pn : C{X) -^ M" denotes the (random) operator 
which maps a function to its values on the first n data points, that is, 

Pnf = {f{Xl),...J{Xn)Y. 

Now we want to recall certain facts from spectral and perturbation theory. 
For more details, we refer to Chatelin [13], Anselone [3] and Kato [29]. By 
o"(T) C C, we denote the spectrum of a bounded linear operator T on some 
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Banach space E. We define the discrete spectrum Ud to be the part of cr(T) 
which consists of all isolated eigenvalues with finite algebraic multiplicity, 
and the essential spectrum aess{T) =a{T) \a^{T). The essential spectrum 
is always closed, and the discrete spectrum can only have accumulation 
points on the boundary to the essential spectrum. It is well known (e.g., 
Theorem IV. 5. 35 in Kato [29]) that compact perturbations do not affect the 
essential spectrum, that is, for a bounded operator T and a compact operator 
V , we have a^issiT + V) = acss{T). A subset r C cr{T) is called isolated if there 
exists an open neighborhood M C C of r such that cr{T) D M = t. For an 
isolated part r C (t{T), the corresponding spectral projection Pr,- is defined 
as 2^ J-p(T — XI)^^ dX, where F is a closed Jordan curve in the complex plane 
separating r from the rest of the spectrum. In the special case where r = {A} 
for an isolated eigenvalue A, Pr,- is a projection on the invariant subspace 
related to A. If A is a simple eigenvalue (i.e., it has algebraic multiplicity 
1), then the spectral projection Pr,- is a projection on the eigenfunction 
corresponding to A. 

Definition 5 {Convergence of operators). Let (£^,11 • \\e) be an arbi- 
trary Banach space, B its unit ball, and {Sn)n a sequence of bounded linear 
operators on E: 

• {Sn)n converges pointwise, denoted by 5„ — > 5, if ||5„x — Sx\\e — > for all 
x€ E. 

• (-S'„)„ converges compactly, denoted by Sn^^S, if it converges pointwise 
and if for every sequence {xn)n in B, the sequence (5 — Sn)Xn is relatively 
compact (has compact closure) in {E, \\ ■ \\e)- 

• {Sn)n converges in operator norm, denoted by 5„— >5, if ||5„ — S\\ -^ 0, 
where || • || denotes the operator norm. 

• {Sn)n is called collectively compact if the set 1J„ SnB is relatively compact 
in (£^, II -He). 

• {Sn)n converges collectively compactly, denoted by Sn^S, if it converges 
pointwise and if there exists some A^ G N such that the operators (S'„ — 
S)n>N are collectively compact. 

Both operator norm convergence and collectively compact convergence 
imply compact convergence. The latter is enough to ensure the convergence 
of spectral properties in the following sense (cf. Proposition 3.18 and Sec- 
tions 3.6 and 5.1 in Chatelin [13]): 

Proposition 6 (Perturbation results for compact convergence). Let (E, 

11 • We) be an arbitrary Banach space and {Tn)n and T bounded linear op- 
erators on E with Tn^T. Let X G cr{T) be an isolated eigenvalue with 
finite multiplicity m, and M G C an open neighborhood of X such that 
a{T)nM = {X}. Then: 
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1. Convergence of eigenvalues: There exists an N & N such that, for all 
n> N, the set cr{Tn) HM is an isolated part of cr{Tn) consists of at most 
m different eigenvalues, and their multiplicities sum up to m. Moreover, 
the sequence of sets a{Tn)r\M converges to the set {A} in the sense that 
every sequence (A„)„gN with Xn G cr{Tn) n M satisfies lim A„ = A. 

2. Convergence of spectral projections: Let Pr be the spectral projection ofT 
corresponding to A, and for n > N , let Pr„, be the spectral projection ofTn 
corresponding to a{Tn) Pi M (which is well defined according to part 1). 
Then Pr„— >Pr. 

3. Convergence of eigenvectors: If, additionally, X is a simple eigenvalue, 
then there exists some N €N such that, for all n > N , the sets cr(r„) n M 
consist of a simple eigenvalue A„. The corresponding eigenf unctions fn 
converge up to a change of sign [i.e., there exists a sequence {an)n of 
signs On G {— 1,+1} such that anfn converges]. 

Proof. See Proposition 3.18 and Sections 3.6 and 5.1 in Chatelin [13]. 
D 

To prove rates of convergence, we will also need some quantitative per- 
turbation theory results for spectral projections. The following theorem can 
be found in Atkinson [5]: 

Theorem 7 (Atkinson [5]). Let {E, || • ||_e) be an arbitrary Banach space 
and B its unit ball. Let {Kn)n<m and K be compact linear operators on E 
such that Kn^K. For a nonzero eigenvalue X S o'lK), denote the corre- 
sponding spectral projection by Pr. Let M C C be an open neighborhood of 
A such that a{K) n M = {A}. There exists some N £N such that, for all 
n> N, the set a{Kn) (1 M is isolated in a{Kn). Let Pr„, the corresponding 
spectral projections of Kn for a{Kn)r\M. Then there exists a constant C > 
such that, for every x £Fv E, 

\\x - Pr„x||s < C{\\{Kn - K)x\\e + MeUK - Kn)Kj). 

The constant C is independent of x, but it depends on X and o'{K). 

For a probability measure P and a function / G C{X), we introduce the 
abbreviation Pf := J f{x)dP{x). Let (Xn)n be a sequence of i.i.d. random 
variables drawn according to P, and denote by Pn := ^/nJ27=i ^Xi the cor- 
responding empirical distributions. A set J- C C{X) is called a Glivenko- 
Cantelli class if 

sup|P/-P„/| ^0 a.s. 

Finally, the covering numbers N{T,e,d) of a totally bounded set J- with 
metric d are defined as the smallest number n such that J- can be covered 
with n balls of radius e. 
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5. Convergence of normalized spectral clustering. In this section we 
present our results on the convergence of normalized spectral clustering. We 
start with an overview over our method, then prove several propositions, 
and finally state and prove our main theorems at the end of this section. 
The case of unnormalized spectral clustering will be treated in Section 7. 

5.1. Overview over the methods. On a high level, the approach to prove 
convergence of spectral clustering is very similar in both the normalized and 
unnormalized case. In this section we focus on the normalized case. More- 
over, as we have already seen that there is an explicit one-to-one relation- 
ship between the eigenvalues and eigenvectors of L^ , L^ and the generalized 
eigenproblem LnV = XDnV, we only consider the matrix L^ in the following. 
All results naturally can be carried over to the other cases. To study the 
convergence of spectral clustering, we have to investigate whether the eigen- 
vectors of the Laplacians constructed on n sample points "converge" for 
n ^ oo. For simplicity, let us discuss the case of the second eigenvector. For 
all n G N, let Vn G M" be the second eigenvector of L^. The technical diffi- 
culty for proving convergence of {vn)neN is that, for difi'erent sample sizes n, 
the vectors Vn live in different spaces (as they have length n). Thus, standard 
notions of convergence cannot be applied. What we want to show instead is 
that there exists a function / G C{X) such that the difference between the 
eigenvector f„ and the restriction of / to the sample converges to 0, that 
is, 11?;^ — Pnf Woo — > 0. Our approach to achieve this takes one more detour. 
We replace the vector Vn by a function /„ G C{X) such that Vn = Pnfn- This 
function /„ will be the second eigenfunction of an operator C/^ acting on the 
space C{X). Then we use the fact that 

\\Vn - Pnf Woo = WPnfn " Pnf Woo < ||/n " /||oo- 

Hence, it will be enough to show that ||/„ — /||oo — > 0. As the sequence, /„ 
will be random, this convergence will hold almost surely. 

Step 1 [Relating the matrices L'^ to linear operators U^ on C{X)]. First 
we will construct a family {U^)neN of linear operators on C{X) which, if 
restricted to the sample, "behaves" as {L'^)n£N'- for all / G C{X), we will 
have the relation PnU'^f = L'^pnf ■ In the following we will then study the 
convergence of {U^)n on C{X) instead of the convergence of {L'^)n- 

Step 2 [Relation between (t{L'^) and a{U'n)\. In Step 1 we replaced the 
operators L'^ by operators U!^ on C{X). But as we are interested in the 
eigenvectors of L'^ , we have to check whether they can actually be recovered 
from the eigenfunctions of U^. By elementary linear algebra, we can prove 
that the "interesting" eigenfunctions /„ and eigenvectors Vn of C/^ and L^ 
are in a one-to-one relationship and can be computed from each other by 
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the relation Vn = pnfn- As a consequence, if the eigenfunctions fn of U^ 
converge, the same is true for the eigenvectors of L'^. 

Step 3 {Convergence of U'^ -^ U'). In this step we want to prove that 
certain eigenvalues and eigenfunctions of U!^ converge to the corresponding 
quantities of some limit operator U' . For this, we will have to establish a 
rather strong type of convergence of linear operators. Pointwise convergence 
is in general too weak for this purpose; on the other hand, it will turn out 
that operator norm convergence does not hold in our context. The type 
of convergence we will consider is compact convergence, which is between 
pointwise convergence and operator norm convergence and is just strong 
enough for proving convergence of spectral properties. The notion of compact 
convergence has originally been developed in the context of (deterministic) 
numerical approximation of integral operators. We adapt those methods to 
a framework where the spectrum of a linear operator U' is approximated 
by the spectra of random operators f/^. Here, a key eleinent is the fact that 
certain classes of functions are Glivenko-Cantelli classes: the integrals over 
the functions in those classes can be approximated uniformly by empirical 
integrals based on the random sample. 

5.2. Step 1: Construction of the operators on C{X). We define the fol- 
lowing functions and operators, which are all supposed to act on C{X): The 
degree functions 

dn{x):= fk{x,y)dPn{y)eC{X), 



d{x):=Jk{x,y)dP{y)eC{X), 

the multiplication operators, 

M,„ : CiX) ^ C{X), MdJix) := dn{x)f{x), 
Md:C{X)^C{X), Mdf{x) :=d{x)f{x), 
the integral operators 

Sn : C{X) ^ CiX), Snfix) := / Hx, y)f{y) dPn{y), 



S -.CiX) ^ CiX), Sf{x):= j k{x,y)f{y)dP{y), 

and the corresponding differences 

Un ■■ C{X) ^ C{X), Unfix) := MdJ{x) - Snfix), 

U : CiX) ^ CiX), Ufix) := M,/(x) - 5/(x). 
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The operators C/„ and U will be used to deal with the case of unnormalized 
spectral clustering. For the normalized case, we introduce the normalized 
similarity functions 



Kix,y) ■.= k{x,y)/^dnix)dniy), 

h{x,y) := k{x,y)/'^d{x)d{y), 
the integral operators 

Tn:C{X)^C{A:), Tnf{x)= fh{x,y)f{y)dPn{y), 



Tn:CiX)^C{X), Tnf{x)= / K{x,y)f{y)dPn{y) 



T:C{X)^C{X), Tf{x) = j h{x,y)f{y)dP{y), 

and the differences 

U' — T — T^ 

U' :=I- T. 

In all what follows, the operators introduced above are always meant to 
act on the Banach space {C{X), \\ ■ ||oo)) and their operator norms will be 
taken with respect to this space. We now summarize the properties of those 
operators in the following proposition. Recall the general assumptions and 
the definition of the restriction operator p„, of Section 4. 

Proposition 8 (Relations between the operators). Under the general 
assumptions, the functions d„ and d are continuous, bounded from below 
by the constant I > 0, and from above by ||A:||oo. All operators defined above 
are bounded, and the integral operators are compact. The ojperator norms of 
Md„, Mj_, Sn and S are bounded by ||A;||oo, the ones of Tn, Tn and T by 
\\k\\oo/l. Moreover, we have the following : 

-Dn o pn = PnO Md^ , -Kn o Pn = Pn ° Sn, 

n n 

-LnO Pn= Pn°Un, L'^ o Pn = Pn ° U'^- 

n 

Proof. All statements follow directly from the definitions and the gen- 
eral assumptions. Note that in the case of the unnormalized Laplacian L„ 
we get the scaling factor 1/n from the 1/n-factor hidden in the empirical 
distribution P„. In the case of the normalized Laplacian, this scaling factor 
cancels with the scaling factors of the degree functions in the denominators. 
D 
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The main statement of this proposition is that if restricted to the sam- 
ple points, Un "behaves as" ^Ln and U^ as L'^. Moreover, by the law of 
large numbers, it is clear that for fixed / G C{X) and x € X the empiri- 
cal quantities converge to the corresponding true quantities, in particular, 
Unf{x) — > U f{x) and U^f{x) — > U' f{x). Proving stronger convergence state- 
ments will be the main part of Step 3. 

5.3. Step 2: Relations between the spectra. The following proposition es- 
tablishes the connections between the spectra of L'^ and U^. We show that 
C/^ and L'^ have more or less the same spectrum and that the eigenfunctions 
/ of U^ and eigenvectors v of L^ correspond to each other by the relation 

V = Pnf- 

Proposition 9 (Spectrum of Un)- 

1. If f & C{X) is an eigenfunction of U!^ with the eigenvalue \, then the 
vector V = pnf € M" is an eigenvector of the matrix L'^ with eigenvalue 
A. 

2. Let A 7^ 1 he an eigenvalue of U^ with eigenfunction f £ C{X), and v := 
(vi, . . . , Vn) '■= Pnf S M". Then f is of the form 

(1) f{x) = ^ ^J ^ ^ ^' ^ 



1-A 

3. // V is an eigenvector of the matrix L[^ with eigenvalue A 7^ 1, then f 
defined by equation (1) is an eigenfunction of U^ with eigenvalue A. 

4. The spectrum of U^ consists of finitely many nonnegative eigenvalues 
with finite multiplicity. The essential spectrum of U'^ consists of at most 
one point, namely, acssiU^) = {!}• I'he spectrum of U' consists of at 
most countably many nonnegative eigenvalues with finite multiplicity. Its 
essential spectrum consists at most of the point {!}, which is also the 
only possible accumulation point in a{U'). 

Proof. Part 1: It is obvious from Proposition 8 that U'^^f = Xf implies 
L'^v = Xv. Note also that part 2 shows that v is not the constant vector. 

Part 2: Follows directly from solving the eigenvalue equation. 

Part 3: Define / as in equation (1). It is well defined because v is an 
eigenvector of -L„, and / is an eigenfunction of C/„ with eigenvalue A. 

Part 4: According to Proposition 8, T„ is a compact integral operator, and 
its essential spectrum is at most {0}. The spectrum a{U^) of U^ = I — Tn is 
given by 1 — a{Tn). The statements about the eigenvalues of U!^ follow from 
the properties of the eigenvalues of L^ and parts 1-3 of the proposition. An 
analogous reasoning leads to the statements for U' . D 
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This proposition establishes a one-to-one correspondence between the 
eigenvalues and eigenvectors of L^ and C/^ , provided they satisfy A 7^ 1 . The 
condition A 7^ 1 needed to ensure that the denominator of equation (1) does 
not vanish. As a side remark, note that the set {1} is the essential spectrum 
of [//j. Thus, the condition A 7^ 1 can also be written as A ^ acss{Un), which 
will be analogous to the condition on the eigenvalues in the unnormalized 
case. This condition ensures that A is isolated in the spectrum. 

5.4. Step 3: Compact convergence. In this section we want to prove that 
the sequence of random operators t/^ converges compactly to U' almost 
surely. First we will prove pointwise convergence. Note that on the space 
C{X), the pointwise convergence of a sequence C/^ of operators is defined 
as \\U'^f — U' fWoo — > 0, that is, for each / G C{X), the sequence {U'^f)n has 
to converge uniformly over X. To establish this convergence, we will need 
to show that several classes of functions are "not too large," that is, they 
are Glivenko-Cantelli classes. For convenience, we introduce the following 
notation: 

Definition 10 {Particular sets of functions). Let k:XxX^M.hea 
similarity function, h:X x X ^M the corresponding normalized similarity 
function as introduced above and g G C{X) an arbitrary function. We use 
the shorthand notation k{x,-), g{-)k{x,-) and h{x,-)h{y,-) to denote the 
functions zh^ k{x,z), z*-^ g{z)k{x,z) and z*-^ h{x,z)h{y,z). We define the 
following: 

/C := {k{x, ■y,x G X}, n := {h{x, •);x G X}, 

g-n:= {g{-)h{x, -y^x G A"}, n-n:= {h{x, ■)h{y, ■y,x,ye X}. 

Proposition 11 (Glivenko-Cantelli classes). Under the general assump- 
tions, the classes K,, TL and g ■ TL [for arbitrary g G C{X)] are Glivenko- 
Cantelli classes. 

Proof. As A; is a continuous function defined on a compact domain, it 
is uniformly continuous. In this case it is easy to construct, for each e > 0, a 
finite e-cover with respect to || • ||oo of /C from a finite 6-cover of X. Hence, 
/C has finite || • ||oo-covering numbers. Then it is easy to see that /C also 
has finite || • ||j;^^(p)-bracketing numbers (cf. van der Vaart and Wellner [45], 
page 84). Now the statement about the class /C follows from Theorem 2.4.1 
of van der Vaart and Wellner [45]. The statements about the classes Ti. and 
g ■ Ti. can be proved in the same way, hereby observing that h is continuous 
and bounded as a consequence of the general assumptions. D 
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Note that it is a direct consequence of this proposition that the empirical 
degree function dn converges uniformly to the true degree function d, that 
is, 

\\dn — d\\oo = sup \dn{x) — d{x)\ = sup \Pnk{x, •) — Pk{x, •)! — > a.s. 



Proposition 12 (T„ converges pointwise to T a.s.). Under the general 
assumptions, Tn^T almost surely. 

Proof. For arbitrary / G C{X), we have 

\\Tnf - T/IU < \\fnf - r„/||oo + \\Tnf " Tf\\^. 

The second term can be written as 

\\TJ -Tf\\^ = sviv\Pn{h{x, ■)!{■))- P{h{x,-)f{-))\= sup \Png-Pgl 
xdX g&f-n 

which converges to a.s. by Proposition 11. The first term can be bounded 
by 

1 1 



|rn/-T„/||oo < ||/||oo||A:||oo sup 

x,yeX 



^/dn{x)dn{y) y^d{x)d{y) 



|A:||oo \dn{x)dn{y) - d{x)d{y)\ 

sup 



^^ x,yex y^dn{x)dn{y) + ^ d{x)d{y) 
< II/IIoo^ttI^ sup \dn{x)dn{y) - d{x)d{y)\ 

^t x,y&X 

<\\f\\J^\dn{x)-d{x)\<\\f\\J^sn^\Png-Pg\. 

' ' g&K. 

Together with Proposition 11 this finishes the proof. D 

Proposition 13 (T„ converges collectively compactly to T a.s.). Under 
the general assumptions, Tn^T almost surely. 

Proof. We have already seen the pointwise convergence Tn^>T in 
Proposition 12. Next we have to prove that, for some N €N, the sequence 
of operators (T„ — T)n>N is collectively compact a.s. As T is compact it- 
self, it is enough to show that {Tn)n>N is collectively compact a.s. This 
will be done using the Arzela-Ascoli theorem (e.g., Section 1.6 of Reed 
and Simon [41]). First we fix the random sequence {Xn)n and, hence, the 
random operators (Tn)n- By Proposition 8, we know that ||T„|| < ||A;||oo/^ 
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for all n G N. Hence, the functions in 1J„ T„i3 are uniformly bounded by 
sup„gNjg5 ll^n/lloo < ll^lloo/^- To prove that the functions in lj„^jy T„i? are 
equicontinuous, we have to bound the expression \g{x) — g{x')\ in terms of 
the distance between x and x' , uniformly in 5 G U„r„i?. For fixed sequence 
{Xn)n£N and all n e N, we have that for all x, x' S X, 



/ {hn{x, y) - K{x\ y))f{y) dPn{y) 



sup \TrJ{x)-TrJ{x')\= SUp 
f&B,n&\ f&B,nm 

< sup WfWoo \hnix,y)-hnix',y)\dPniy) 
feB,neN J 

< \\hn{x,-) -hnix',-)\\oo- 

Now we have to prove that the right-hand side gets small whenever the 
distance between x and x' gets small: 

sup\hn{x,y) -hn{x',y)\ 
y 

< j^iWVd^MJlHx,-) - fc(x',-)||oo + \\k\U^/dA^) - ^/d^)\) 

<^{\\k\\'J.'mx,-) - Hx',.)\\^ + ^l^n(x) - d„(x')|) 
<Ci\\k{x,-)-Hx',-)\\oo+C2\d{x)-d{x')\+C3\\dn-d\U 

As Af is a compact space, the continuous functions k (on the compact space 
X X X) and d are in fact uniformly continuous. Thus, the first two (determin- 
istic) terms ||A;(x,-) — A;(x', •)||cxd and \d{x) — d{x')\ can be made arbitrarily 
small for all x,x' whenever the distance between x and x' is small. For the 
third term ||dn — rf||oo) which is a random term, we know by the Glivenko- 
Cantelli properties of Proposition 11 that it converges to a.s. This means 
that for each given e > there exists some iV G N such that, for all n > A^, we 
have \\dn — d\\oo < e a.s. Together, these arguments show that lJ„>^T„i? 
is equicontinuous a.s. By the Arzela-Ascoli theorem, we then know that 
Un>Af^n^ is relatively compact a.s., which concludes the proof. D 

Proposition 14 [U'^ converges compactly to U' a.s.). Under the general 
assumptions, U^^U' a.s. 

Proof. This follows directly from the facts that collectively compact 
convergence implies compact convergence, the definitions of U^ to U' , and 
Proposition 13. D 
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5.5. Assembling all pieces. Now we have collected all ingredients to state 
and prove our convergence result for normalized spectral clustering. The 
following theorem is the precisely formulated version of the informal Result 
1 of the introduction: 

Theorem 15 (Convergence of normalized spectral clustering). Assume 
that the general assumptions hold. Let A 7^ 1 be an eigenvalue of U' and 
M gC an open neighborhood of A such that (t{U') r\M = {A}. Then: 

1. Convergence of eigenvalues: The eigenvalues in cr{L'^) D M converge to 
A in the sense that every sequence {Xn)neN with A^ S cr{L'^) n M satisfies 
A„ ^ A almost surely. 

2. Convergence of spectral projections: There exists some N G'N such that, 
for n> N , the sets o{U'^) n M are isolated in a{U!^). For n> N , let Pr^ 
be the spectral projections of U'^ corresponding to cy{U'^) n M , and Pr the 
spectral projection of U for A. Then Pr^^^Pr a.s. 

3. Convergence of eigenvectors: If X is a simple eigenvalue, then the eigen- 
vectors of L'^ converge a.s. up to a change of sign: if Vn is the eigenvector 
of L'^ with eigenvalue A„, Vn,i its ith coordinate, and f the eigenf unction 
of eigenvalue X, then there exists a sequence (an)neN with ai G {+1,-1} 
such that supj^;^ „ \anVn,i — f{Xi)\ -^ a-S- IiT- particular, for all 6 G M, 
the sets {anfn > b} and {/ > b} converge, that is, their symmetric differ- 
ence satisfies P{{f > b}A{anfn > b}) — > 0. 

Proof. In Proposition 9 we established a one-to-one correspondence 
between the eigenvalues A 7^ 1 of L^ and U^ , and we saw that the eigenvalues 
A of U' with A 7^ 1 are isolated and have finite multiplicity. In Proposition 
14 we proved the compact convergence of C/^ to U' , which according to 
Proposition 6 implies the convergence of the spectral projections of isolated 
eigenvalues with finite multiplicity. For simple eigenvalues, this implies the 
convergence of the eigenvectors up to a change of sign. The convergence of 
the sets {fn> b} is a simple consequence of the almost sure convergence of 

{anfn)n- □ 

Observe that we only get convergence of the eigenvectors if the eigenvalue 
of the limit operator is simple. If this assumption is not satisfied, we only 
get convergence of the eigenspaces, but not of the individual eigenvectors. 

6. Rates of convergence in the normalized case. In this section we want 
to prove statements about the rates of convergence of normalized spectral 
clustering. Our main result is the following: 
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Theorem 16 (Rate of convergence of normalized spectral clustering). 
Under the general assumptions, let X^ be a simple eigenvalue of T with 
eigenfunction u, (A„)„ a sequence of eigenvalues of Tn such that \n — > A, 
and {un)n 0. corresponding sequence of eigenf unctions. Define J- =1CU [u- 
TC) U {TC ■ 7i). Then there exists a constant C" > [which only depends on 
the similarity function k, on cr{T) and on X] and a sequence {an)n of signs 
On G {+1;~1} such that 

||a„ti„-n||oo < C'sup|P„/-P/|. 

This theorem shows that the rate of convergence of normalized spectral 
clustering is at least as good as the rate of convergence of the supremum of 
the empirical process indexed by T. To determine the latter, there exist a 
variety of tools and techniques from the theory of empirical processes, such 
as covering numbers, VC dimension and Rademacher complexities; see, for 
example, van der Vaart and Wellner [45], Dudley [18], Mendelson [34] and 
Pollard [39]. In particular, it is the case that "the nicer" the kernel function 
k is (e.g., k is Lipschitz, or smooth, or positive definite), the faster the rate of 
convergence on the right-hand side will be. As an example we will consider 
the case of the Gaussian similarity function A;(x,y) = exp(— ||x — yp/o"^), 
which is widely used in practical applications of spectral clustering. 

Example 1 (Rate of convergence for Gaussian kernel). Let X he com- 
pact subset of W^ and k{x,y) = exp(— ||x — yp/o"^). Then the eigenvectors 
in Theorem 16 converge with rate OiXj^fn). 

For the case of unnormalized spectral clustering, it is possible to obtain 
similar results on the speed of convergence, for example, by using Propo- 
sition 5.3 in Chapter 5 of Chatelin [13] instead of the results of Atkinson 
[5] (note that in the unnormalized case, the assumptions of Theorem 7 are 
not satisfied, as we only have compact convergence instead of collectively 
compact convergence). As we recommend to use normalized rather than 
unnormalized spectral clustering anyway, we do not discuss this issue any 
further. The remaining part of this section is devoted to the proofs of The- 
orem 16 and Example 1. 

6.1. Some technical preparations. Before we can prove Theorem 16 we 
need to show several technical propositions. 

Proposition 17 (Some technical bounds). Assume that the general 
conditions are satisfied, and let g G CX . Then the following hounds hold 
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true: 

ujT _rp u ^ Mlk^simlP f - Pf\ 

\\{Tn-T)g\\^< sup \Pnf-Pf\, 

\\{T-Tn)Tj< sup \P^f-Pf\. 
fen-n 

Proof. The first inequality can be proved similarly to Proposition 12, 
the second inequality is a direct consequence of the definitions. The third 
inequality follows by straight forward calculations similar to the ones in the 
previous section and using Fubini's theorem and the symmetry of h. D 

Proposition 18 (Convergence of one-dimensional projections). Let{vn)n 
be a sequence of vectors in some Banach space {E, \\ ■ ||) with \\vn\\ = 1, Pr„ 
the projections on the one- dimensional subspace spanned by Vn, and v £ E 
with \\v\\ = 1. Then there exists a sequence {an)n S {+1,-1} of signs such 
that 

\\0'nVn — v\\ < 2\\v — Pr„t;||. 

In particular, if \\v — Pr„u|| -^ 0, then Vn converges to v up to a change of 
sign. 

Proof. By the definition of Pr,„, we know that Pr,„ v = CnVn for some 
Cn € K. Define an := sgn(c„). Then 

\an - Cn\ = |1 - \Cn\\ = \\\v\\ - |c„| • ||t;„||| < ||t'-Cnt'n|| = Ht'-PrnvH. 

From this, we can conclude that 

||t;-a„7;„|| < ||v-c„Vn|| + \\cnVn- anVnW < 2||7; - Pr„ t;|| . D 

6.2. Proof of Theorem 16. First we fix a realization of the random vari- 
ables {Xn)n- From the convergence of the spectral projections in Theorem 
15 we know that if A G (jiT) is simple, so are An G cr{Tn) for large n. Then 
the eigenfunctions Un are uniquely determined up to a change of orienta- 
tion. In Proposition 18 we have seen that the speed of convergence of ii„ 
to u is bounded by the speed of convergence of the expression ||n — Pr„ti|| 
from Theorem 7. As we already know by Section 5, the operators T^ and 
T satisfy the assumptions in Theorem 7. Accordingly, \\u — Pr„u|| can be 
bounded by the two terms \\{Tn — T)u\\ and \\(T — Tn)T„ ||. It will turn out 



22 U. VON LUXBURG, M. BELKIN AND O. BOUSQUET 

that both terms are easier to bound if we can replace the operator T„ by 
Tn- To accomplish this, observe that 

WiT — T~'lT~ll < IITIIIIT — f^ll -I- llfT — T \T II 

j^WT T — T T~ll -I- IIT T~ — T"T^II 



and also 



< 3^p^||T„ - T„|| + ||(r - r„)T„| 



||(T„ - r)n||oo < ||w||oo||T„ - Tn\\ + ||(r„ - r)M|| 



Note that T„, does not converge to T in operator norm (cf. page 197 in 
Section 4.7.4 of Chatelin [13]). Thus, it does not make sense to bound ||(r„ — 
T)u\\oo by ||r„-r||||it||oo or ||(r-r„)T„|| by ||r-r„||||r„||. Assembling all 
inequalities, applying Proposition 18 and Theorem 7, and choosing the signs 
a„ as in the proof of Proposition 18, we obtain 

hnUn - u\\ < 2\\u - PrA„ u\\ < 2C(||(f; - r)n|| + ||(r - f;)f;||) 

< 2C(^(^Sk + 1^ ||r„ -TJ + \\{Tn - r)n||oo + ||(T - Tn)Tn\ 
<C' sup \Pnf-Pf\. 

feicu{u-n)u{n-H) 

Here the last step was obtained by applying Proposition 17 and merging all 
occurring constants to one larger constant C. As all arguments hold for each 
fixed realization (X„)„ of the sample points, they also hold for the random 
variables themselves almost surely. This concludes the proof of Theorem 16. 

6.3. Rate of convergence for the Gaussian kernel. In this subsection we 
want to prove the convergence rate 0{l/y/n) stated in Example 1 for the 
case of a Gaussian kernel function k{x,y) = exp(— ||x — y|p/cr^). In principle, 
there are many ways to compute rates of convergence for terms of the form 
sup f \P f — Pnf\ (see, e.g., van der Vaart and Wellner [45]). As discussing 
those methods is not the main focus of our paper, we choose a rather simple 
covering number approach which suffices for our purposes, but might not 
lead to the sharpest possible bounds. We will use the following theorem, 
which is well known in empirical process theory (nevertheless, we did not 
find a good reference for it; it can be obtained for example by combining 
Section 3.4 of Anthony [4], and Theorem 2.34 in Mendelson [34]): 

Theorem 19 (Entropy bound). Let {X,A,P) be an arbitrary probability 
space, T a class of real-valued functions on X with ||/||oo < 1- Let {Xn)n&i 
be a sequence of i.i.d. random variables drawn according to P, and {Pn)n&n 
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the corresponding empirical distributions. Then there exists some constant 
c > such that, for all nGN with probability at least 1 — d, 

c r°° I rr 2 

sup|P„/-P/|<^ / JlogN{J^,e,L2{Pn))de + J — log-, 
far Vf^Jo y 2n 6 



We can see that if /q°° yjlog N{!F,£,L2{Pn))de < oo, then the whole ex- 
pression scales as 0{l/^/n). As a first step we would like to evaluate this 
integral for the function class T := K,. To this, end we use bounds on the 
II ■ lloo-covering numbers of /C obtained in Proposition 1 in [51]. There it 
was proved that for e < cq for a certain constant cq > only depending to 
the kernel width o", and for some constant C which just depends on the 
dimension of the underlying space, the covering numbers satisfy 

logiV(/C,e,||-||oo)<C(^log- 
Plugging this into the integral, above we get 







\ogN{lC,e,L2{Pn))de 
< j ^\ogN{lC,e,\\-\\oo)de 
<VC riog-de+ [ 7logiV(/C,e,||-||oo)de 

JO e Jen ^ 



< VC'co(l-logco) + (2-co)ylogiV(/C,co,|| • ||oo) < oo. 

According to Theorem 16, we have to use the entropy bound not only for 
the function class T = IC, but for the class J^ = ICU {u • TC) U {7i ■ TC) . To this 
end, we will bound the || • ||oo-covering numbers of IC L) {u ■ TC) U {7i ■ 7i) in 
terms of the covering numbers of /C. 

Proposition 20 (Covering numbers). Under the general assumptions, 
the following covering number bounds hold true: 

N{n,e,\\-\\oo)<N{IC,se,\\-\\^), 

N{IC U{u-n)U{n-n),e,\\- lU) < 3iV(/C, qe, II • lloo), 

where s = — °° „,Y —, q := min{l, ||u||oog, " | s} and u G C{X) arbi- 
trary. 

This can be proved by straight forward calculations similar to the ones 
presented in the previous sections. 
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Combining this proposition with the integral bound for the Gaussian ker- 
nel as computed above, we obtain 

y°° ^ log N{:F,e,L2{Pn))de < ^ ^J\og^N{lC,qe,\\-\\oo)de < oo. 

The entropy bound in Theorem 19 hence shows that the rate of convergence 
of supjgjp |P„/ — Pf\ is 0{l/^/n), and by Theorem 16, the same now holds 
for the eigenfunctions of normalized spectral clustering. 

7. The unnormalized case. Now we want to turn our attention to the 
case of unnormalized spectral clustering. It will turn out that this case is 
not as nice as the normalized case, as the convergence results will hold 
under strong conditions only. Moreover, those conditions are often violated in 
practice. In this case, the eigenvectors do not contain any useful information 
about the clustering of the data space. 

7.1. Convergence of unnormalized spectral clustering. The main theorem 
about convergence of unnormalized spectral clustering (which was informally 
stated as Result 2 in Section 3) is as follows: 

Theorem 21 (Convergence of unnormalized spectral clustering). As- 
sume that the general assumptions hold. Let A ^ rg((i) be an eigenvalue ofU 
and Af C C an open neighborhood of A such that cr{U) Ci M = {A}. Then: 

1. Convergence of eigenvalues: The eigenvalues in a{-Ln) n A/ converge 
to A in the sense that every sequence {Xn)neN with A„ G o"(^L„) n M 
satisfies An — > A almost surely. 

2. Convergence of spectral projections: There exists some A^ E N such that, 
for n > N , the sets cr{Un) H M are isolated in a{Un). For n > N , let Pr„ 
be the spectral projections of Un corresponding to a{Un) n M , and Pr the 
spectral projection of U for A. Then Pr.„— >Pr a.s. 

3. Convergence of eigenvectors: If \ is a simple eigenvalue, then the eigen- 
vectors of ^Ln converge a.s. up to a change of sign: if Vn is the eigenvec- 
tor of -Ln with eigenvalue A„, Vn,i its ith coordinate, and f the eigen- 
function of U with eigenvalue A, then there exists a sequence (o„)nGN 
with Oi G {+1, —1} such that supj=i^ „ \cinVn,i — /(^i)| — > a.s. In par- 
ticular, for all & G M, the sets {anfn > b} and {/ > b} converge, that is, 
their symmetric difference satisfies P{{f > b} A {anfn > b}) — > 0. 

This theorem looks very similar to Theorem 15. The only difference is 
that the condition A 7^ 1 of Theorem 15 is now replaced by A ^ i'g(t^)- Note 
that in both cases, those conditions are equivalent to saying that A must 
be an isolated eigenvalue. In the normalized case, this is satisfied for all 
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eigenvalues but A = 1, as U' = I — T' where T' is a compact operator. In the 
unnormalized case, however, this condition can be violated, as the spectrum 
of U contains a large continuous spectrum. Later we will see that this indeed 
leads to serious problems. 

The proof of Theorem 7 is very similar to the one we presented in Section 
5. The main difference between both cases is the structure of the spectra of 
Un and U. The proposition corresponding to Proposition 9 is the following: 

Proposition 22 (Spectrum of [/„). 

1. If f & C{X) is an eigenfunction of Un with arbitrary eigenvalue X, then 
the vector v = pnf ^ M" is an eigenvector of the matrix -Ln with eigen- 
value A. 

2. Let A ^ rg((i„) be an eigenvalue of Un with eigenfunction f £ C{X), and 
V := {vi, . . . , Vn) '■= Pnf £ I^"- Then f is of the form 

3. If V is an eigenvector of the matrix -Ln with eigenvalue A ^ rg((i„), then 
f defined by equation (2) is an eigenfunction of Un with eigenvalue A . 

4. The essential spectrum of Un coincides with the range of the degree func- 
tion, that is, acssiUn) = rg('^n)- ^H eigenvalues ofUn are nonnegative and 
can have accumulation points only in rg((i„). The analogous statements 
also hold for the operator U . 

Proof. The first parts can be proved analogously to Proposition 9. For 
the last part, remember that the essential spectrum of the multiplication 
operator Md„ consists of the range of the multiplier function dn- As Sn is a 
compact operator, the essential spectrum of Un = Md„ — Sn coincides with 
the essential spectrum of M(i„ , as we have already mentioned in the begin- 
ning of Section 4. The accumulation points of the spectrum of a bounded 
operator always belong to the essential spectrum. Finally, to see the non- 
negativity of the eigenvalues, observe that if we consider the operator Un as 
an operator on L2{Pn) we have 



{UnfJ)= I I (fix)- f{y))f{x)k{x,y)dPn{y)dPn{x) 

\f{x) - f{y)fk{x, y) dPniy) dPn{x) > 0. 



1 / lrff^\ ^^„.^^2, 



Thus, [/ is a nonnegative operator on L2{Pn) and as such only has a non- 
negative eigenvalues. As we have C{X) C L2{P) by the compactness of X , 
the same holds for the eigenvalues of U as an operator on C{X). □ 
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This proposition establishes a one-to-one relationship between the eigen- 
values of Un and ^-^n, provided the condition A ^ rg((i„) is satisfied. Next 
we need to prove the compact convergence of C/„ to U: 

Proposition 23 {Un converges compactly to U a.s.). Under the general 
assumptions, U^^U a.s. 

Proof. We consider the multiplication and integral operator parts of 
Un separately. Similarly to Proposition 13, we can prove that the integral op- 
erators Sn converge collectively compactly to 5 a.s., and, as a consequence, 
also Sn^ S a.s. For the multiplication operators, we have operator norm 
convergence 

||Md„-Mrf||= sup \\dnf-df\\^<\\dn-d\\^^0 a.s. 

ll/lloo<l 

by the Glivenko-Cantelli Proposition 11. As operator norm convergence im- 
plies compact convergence, we also have M^^ — > M^ a.s. Finally, it is easy 
to see that the sum of two compactly converging operators also converges 
compactly. D 

Now Theorem 21 follows by a proof similar to the one of Theorem 15. 

8. Nonisolated eigenvalues. The most important difference between the 
limit operators of normalized and unnormalized spectral clustering is the 
condition under which eigenvalues of the limit operator are isolated in the 
spectrum. In the normalized case this is true for all eigenvalues A 7^ 1, 
while in the unnormalized case this is only true for all eigenvalues satis- 
fying A ^ rg(d). In this section we want to investigate those conditions more 
closely. We will see that, especially in the unnormalized case, this condition 
can be violated, and that in this case spectral clustering will not yield sen- 
sible results. In particular, the condition A ^ rg{d) is not an artifact of our 
methods, but plays a fundamental role. It is the main reason why we suggest 
to use normalized rather than unnormalized spectral clustering. 

8.1. Theoretical results. First we will construct a simple example where 
all nontrivial eigenvalues A2, A3, . . . lie inside the range of the degree function. 



Example 2 [A2 ^ rg(d) violated]. Consider the data space X = [1,2] c 
M and the probability distribution given by a piecewise constant probability 
density function p on X with p{x) = s if 4/3 < x < 5/3 and p{x) = (3 — 
s)/2 otherwise, for some fixed constant s G [0,3] (for example, for s = 0.3, 
this density has two clearly separated high density regions). As similarity 
function, we choose k{x,y) :=xy. Then the only eigenvalue of U outside of 
Tg{d) is the trivial eigenvalue with multiplicity one. 
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Proof. In this example, it is straightforward to verify that the degree 
function is given as d{x) = 1.5x (independently of s) and has range [1.5,3] 
on A". A function / £ C{X) is an eigenfunction with eigenvalue A ^ rg(d) of 
U if the eigenvalue equation is satisfied: 

(3) Uf{x) = d{x)f{x) - X lyf{y)p{y) dy = \f{x). 



Defining the real number (3 := J yf{y)p{y) dy, we can solve equation (3) for 

d{x)~X- 



f{x) to obtain f{x) = ,.fr^ , . Plugging this into the definition of (3 yields 



the condition 

(4) l = /_^j,(,),,. 

Hence, A is an eigenvalue of U if equation (4) is satisfied. For our simple 
density function p, the integral in this condition can be solved analytically. It 

2 

can then be seen that g{X) := J M^\_)^ p{y) dy = 1 is only satisfied for A = 0, 
hence, the only eigenvalue outside of rg((i) is the trivial eigenvalue with 
multiplicity one. D 

In the above example we can see that there indeed exist situations where 
the operator U does not possess a nonzero eigenvalue with A ^ rg((i). The 
next question is what happens in this situation. 

Proposition 24 [Clustering fails if A2 ^ Tg{d) is violated]. Assume that 
cr(U) = {0} U rg(d) with the eigenvalue having multiplicity 1, and that the 
probability distribution P on X has no point masses. Then the sequence 
of second eigenvalues of -Ln converges to nimx£xd{x). The corresponding 
eigenfunction will approximate the characteristic function of some x £ X 
with d{x) = vainx^x d{x) or a linear combination of such functions. 

Proof. It is a standard fact (Chatelin [13]) that for each A inside the 
continuous spectrum Tg{d) of U there exists a sequence of functions {fn)n 
with ||/„,|| = 1 such that \\{U — A/)/„|| — > 0. Hence, for each precision e > 0, 
there exists a function /^ such that \\{U — A/)/e|| < e. This means that for 
a computer with machine precision e, the function /^ appears to be an 
eigenfunction with eigenvalue A. Thus, with a finite precision calculation, 
we cannot distinguish between eigenvalues and the continuous spectrum of 
an operator. A similar statement is true for the eigenvalues of the empirical 
approximation C/„ of U . To make this precise, we consider a sequence {fn)n 
as follows. For given A E rg(d), we choose some x\ £ X with d{x\) = A. 
Define B^ ■= B{x\, —) as the ball around xx with radius 1/n (note that 
Bn does not depend on the sample), and choose some /„ E C{X) which 
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is constant 1 on i?„ and constant outside i?n-i- It can be verified by 
straight forward arguments that this sequence has the property that for 
each machine precision e there exists some A^ S N such that, for n> N, we 
have \\{Un — A/)/„|| < e a.s. By Proposition 8 we can conclude that 



-Ln-\i]{f{x,),...,f{Xn)y 

n 



< e a.s. 



Consequently, if the machine precision of the numerical eigensolver is e, then 
this expression cannot be distinguished from 0, and the vector {f(Xi), . . . , 
f{Xn)Y appears to be an eigenvector of ^L„ with eigenvalue A. As this con- 
struction holds for each A G rg(d), the smallest nonzero "eigenvalue" discov- 
ered by the eigensolver will be A2 := mina;g;t' d{x). If xx^ is the unique point 
in X with d{x\2) = X2, then the second eigenvector of ^L„ will converge to 
the delta- function at xx2- If there are several points x £ X with d{x) = A2, 
then the "eigenspace" of A2 will be spanned by the delta-functions at all 
those points. In this case, the eigenvectors of -L„ will approximate one of 
those delta- functions, or a linear combination thereof. D 

As a side remark, note that as the above construction holds for all ele- 
ments A € rg((i), eventually the whole interval rg((i) will be populated by 
eigenvalues of ^-^n- 

So far we have seen that there exist examples where the assumption A ^ 
rg((i) in Theorem 21 is violated, and that in this case the corresponding 
eigenfunction does not contain any useful information for clustering. This 
situation is aggravated by the fact that the condition A ^ rg(d) can only be 
verified if the operator U, and hence, the probability distribution P on X, 
is known. As this is not the case in the standard setting of clustering, it is 
impossible to know whether the condition A ^ rg(d) is true for the eigenvalues 
in consideration or not. Consequently, not only spectral clustering can fail 
in certain situations, but we are unable to check whether this is the case 
for a given application of clustering or not. The least thing one should do if 
one really wants to use unnormalized spectral clustering is to estimate the 
critical region Tg{d) by [min, di/n, maxj di/n] and check whether the relevant 
eigenvalues of -L„ are inside or close to this interval or not. This observation 
then gives an indication whether the results obtained can considered to be 
reliable or not. 

Finally, we want to show that such problems as described above do not 
only occur in pathological examples, but they can come up for many simi- 
larity functions which are often used in practice. 

Proposition 25 (Finite discrete spectrum for analytic similarity). As- 
sume that X is a compact subset of M", and the sim,ilarity function k is 
analytic in a neighborhood of X x X . Let P be a probability distribution 
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Fig. 1. Eigenvalues and eigenvectors of the unnormalized Laplacian. Eigenvalues within 
rg(d„) and the trivial first eigenvalue are plotted as stars, the 'informative" eigenvalues 
below rg(d„) are plotted as diamonds. The dashed line indicates nimd„(a;). The parameters 
are a — 1 (first row), a — 2 (second row) a — 5 (third row), and cr = 50 (fourth row). 

on X which has an analytic density function. Assume that the set {x* G 
X]d{x*) = Ta.\n.x£X d{x)} is finite. Then (y{U) has only finitely many eigen- 
values outside rg(d). 

This proposition is a special case of results on the discrete spectrum of the 
generalized Friedrichs model which can be found, for example, in Lakaev [32], 
Abdullaev and Lakaev [1] and Ikromov and Sharipov [26]. In those articles, 
the authors only consider the case where P is the uniform distribution, but 
their proofs can be carried over to the case of analytic density functions. 



8.2. Empirical results. To illustrate what happens for unnormalized spec- 
tral clustering if the condition A ^ rg(d) is violated, we want to analyze 
empirical examples and compare the eigenvectors of unnormalized and nor- 
malized graph Laplacians. Our goal is to show that problems can occur in ex- 
amples which are highly relevant to practical applications. As data space, we 
choose X = M. with a density which is a mixture of four Gaussian with means 
2, 4, 6 and 8, and the same standard deviation 0.25. This density consists 



30 



U. VON LUXBURG, M. BELKIN AND O. BOUSQUET 



Eigenvalues 



o 

T-^ OS 

£ a. 

g O.J 

O 02 



Eigenvalues 



m 0- 
|o,e 

c' 0* 

O 0-2 
- — C* 



12 3 4 5 



^^ Eigenvalues 



1 o.s 

i" 

E 0.2 



123456799 10 



Eigenvector 2 Eigenvectors Eigenvector 4 Eigenvectors 

0.4 




Eigenvector 2 Eigenvector 3 Eigenvector 4 Eigenvector 5 

02 




Fig. 2. Eigenvalues and vectors of the normalized Laplacian for o = 1, a = 5 and ct = 50. 



of four very well separated clusters, and it is so simple that every clustering 
algorithm should be able to identify the clusters. As similarity function we 



choose the Gaussian kernel function k{x,y) = exp(- 



y\\ /a ), which is 



the similarity function most widely used in applications of spectral cluster- 
ing. It is difficult to prove analytically how many eigenvalues will lie below 
rg(d); by Proposition 25, we only know that they are finitely many. However, 
in practice, it turns out that "finitely many" often means "very few," for 
example, two or three. 

In Figures 1 and 2 we show the eigenvalues and eigenvectors of the normal- 
ized and unnormalized Laplacians, for different values of the kernel width 
parameter a. To obtain those plots, we drew 200 data points at random 
from the mixture of Gaussians, computed the graph Laplacians based on 
the Gaussian kernel function, and computed its eigenvalues and eigenvec- 
tors. In the unnormalized case we show the eigenvalues and vectors of L„, 
in the normalized case those of the matrix L„. In each case we then plot 
the first 10 eigenvalues ordered by size (i.e., we plot i vs. Aj), and the eigen- 
vectors as functions on the data space (i.e., we plot Xj vs. Vi). In Figure 1 
we show the behavior of the unnormalized graph Laplacian for various val- 
ues of a. We can observe that the larger the value of a is, the more the 
eigenvalues move toward the range of the degree function. For eigenvalues 
which are safely below this range, the corresponding eigenvectors are non- 
trivial, and thresholding them at leads to a correct split between different 
clusters in the data (recall that the clusters are centered around 2, 4, 6 and 
8). For example, in case of the plots in the first row of Figure 1, threshold- 
ing Eigenvector 2 at separates the first two from the second two clusters. 
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thresholding Eigenvector 3 separates clusters 1 and 4 from the clusters 2 
and 3, and Eigenvector 4 separates clusters 1 and 3 from clusters 2 and 
4. However, for eigenvalues which are very close to or inside rg{dn), the 
corresponding eigenvector is close to a Dirac vector. In Figure 2 we show 
eigenvalues and eigenvectors of the normalized Laplacian. We can see that, 
for all values of a, all eigenvectors are informative about the clustering, and 
no eigenvector has the form of a Dirac function. This is even the case for 
extreme values as fx = 50. 

9. Conclusion. In this article we investigated the consistency of spec- 
tral clustering algorithms by studying the convergence of eigenvectors of 
the normalized and unnormalized Laplacian matrices on random samples. 
We proved that, under standard assumptions, the first eigenvectors of the 
normalized Laplacian converges to eigenfunctions of some limit operator. In 
the unnormalized case, the same is only true if the eigenvalues of the limit 
operator satisfy certain properties, namely, if these eigenvalues lie below the 
continuous part of the spectrum. We showed that in many examples this 
condition is not satisfied. In those cases, the information provided by the 
corresponding eigenvector is misleading and cannot be used for clustering. 

This leads to two main practical conclusions about spectral clustering. 
First, from a statistical point of view, it is clear that normalized rather than 
unnormalized spectral clustering should be used whenever possible. Second, 
if for some reason one wants to use unnormalized spectral clustering, one 
should try to check whether the eigenvalues corresponding to the eigenvec- 
tors used by the algorithm lie significantly below the continuous part of the 
spectrum. If that is not the case, those eigenvectors need to be discarded, 
as they do not provide information about the clustering. 
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